get entire test suite working with offload by evgunter · Pull Request #962 · imbue-ai/mngr

evgunter · 2026-03-25T17:22:54Z

it did some suspicious thing where it replaced all the mng llmdb calls to "speed them up" so that should definitely be checked

also was there some reason we didn't have a .dockerignore before? i vaguely remember something like this

doesn't actually switch over the local tests to default to offload--i want to make sure we have container retries before doing that

Summary

Configure test-results as a shared worktree symlink so offload baselines are shared
Update .offload-base-commit to tip of main (36f9e3b6d)
Add .dockerignore to prevent Modal upload races from .git/, __pycache__/, etc.
Change sandbox_init_cmd to use git apply + git add -A (keeps index in sync without the strictness of --index)
Add mng-test-runner non-root user to Dockerfile (Modal ignores USER directive, so it's unused for now)
Monkeypatch atomic_write in OSError test since Modal always runs as root
Auto-invalidate offload image cache when .offload-base-commit, Dockerfile, or offload-modal.toml changes
Call conversation_db.py directly instead of mng llmdb in chat.sh (eliminates ~10s startup per call)
Fix tunnel accept loop shutdown race (reduce accept timeout, increase join timeout)
Fix leaked processes: add AgentCreator.close() and call it in all creation tests, fix kanpan dispatch test, coordinator tests, forwarding server tests
Terminate mng events background processes in all stream manager CG tests
Wait for unison to start before stopping in pair test
Use file:///nonexistent-repo instead of https://github.com/test/repo in tests (fails instantly, no network hang)

Test plan

just test-offload passes with 0 failures -- 20 consecutive clean runs (136,660 test executions, 0 failures)
Verify worktree symlink: new agents from ~/mng get test-results/ symlinked
Verify cache invalidation: changing .offload-base-commit triggers image rebuild without --no-cache

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The work_dir_extra_paths table header was absorbing the default_destroyed_host_persisted_seconds scalar field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Update .offload-base-commit to tip of origin/main (36f9e3b) so the base image includes xxd and excludes deleted files like benchmark_branches.sh - Use git apply --index in sandbox_init_cmd so the git index stays in sync with the working tree (fixes git ls-files listing deleted files) - Add mng-test-runner user to Dockerfile for future use running tests as non-root (some tests assume filesystem permission checks work correctly) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Active agents modify .git/worktrees/*/FETCH_HEAD during git operations, which races with Modal's image upload and causes "modified during build process" errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Git state is delivered via a tarball (current.tar.gz), not the live .git dir. Files under .git/ are actively modified by running agents, which races with Modal's upload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Skip test_write_cli_completions_cache_handles_oserror when running as root (root bypasses filesystem permission checks, making the test invalid) - Handle missing files in find_bash_scripts_without_strict_mode (git ls-files can list files that were deleted by git apply but not removed from the index) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hash .offload-base-commit and the Dockerfile to detect when the cached image is stale. Previously, changing the base commit required manually passing --no-cache or deleting .offload-image-cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

git apply --index requires exact match between working tree and index, which fails when the Dockerfile's COPY and tarball extraction leave minor discrepancies. Instead, apply the patch normally then git add -A to update the index to match the working tree. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The Docker image only needs current.tar.gz from the build context. Exclude __pycache__, .venv, test artifacts, and other directories that are modified by concurrent processes during Modal's upload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Reduce _ACCEPT_TIMEOUT_SECONDS from 1.0s to 0.2s so the tunnel accept loop checks the shutdown event more frequently - Increase test join timeout from 3s to 10s for slow Modal containers - Add UnisonSyncer.wait_for_started() and use it in test_pair_files_syncs_git_state_before_starting so the unison binary is actually invoked before the test stops the syncer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test_dispatch_immediate_custom_command_does_not_mark was using "mng connect $MNG_AGENT_NAME" which blocks forever, leaking a subprocess via the ThreadPoolExecutor. Use "true" instead since the test only verifies dispatch routing (marks not set), not command execution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Terminate mng events --follow processes before CG exit in test_stream_manager_mixed_local_and_remote to avoid 4s timeout flake - Bump ChatTestEnv default timeout from 10s to 30s for slow containers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add AgentCreator.wait_for_completion() and call it in tests that start_creation() with an invalid repo, so the background git clone finishes before session cleanup runs - Add process_manager.terminate_all() to coordinator tests that spawn handler subprocesses via process_tasks() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The mng CLI has heavy startup cost (~10s on Modal containers) due to Python interpreter + Click + full module graph initialization. chat.sh was calling mng llmdb for simple SQLite operations that only need sqlite3 and sys. Provision conversation_db.py alongside chat.sh and call it directly with python3. This eliminates the timeout failures in offload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@pure

Problem: _check_local_symlink_state was decorated with @pure, but it performs filesystem I/O (is_symlink, resolve, exists) and its output depends on filesystem state rather than only on its arguments. Fix: Removed the @pure decorator since the function is not pure. All other @pure functions in host.py are genuinely pure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # libs/mng/imbue/mng/hosts/host.py

Modal's dockerignore may not match nested __pycache__/ dirs with a bare pattern. Use **/__pycache__/ to be explicit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use file:///nonexistent-repo instead of https://github.com/test/repo so git clone fails instantly rather than hanging on network. Also add wait_for_completion calls to tests that fire-and-forget agent creation. Also fix .dockerignore to use ** prefix for nested directory matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All 5 tests that use with manager._cg: and call _handle_discovery_line now terminate background mng events processes before exiting the CG. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add chown to sandbox_init_cmd so mng-test-runner owns /code/mng - Use runuser to run pytest as mng-test-runner - Remove skipif(root) from completion_writer_test since tests no longer run as root - Remove exists() band-aid from find_bash_scripts_without_strict_mode since git add -A keeps the index in sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sandbox_init_cmd is baked into the image, so changes to it need to invalidate the cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Background threads from start_creation() leak git clone processes and log errors that pollute other tests' captured output. Add close() to join all threads, and call it in every test that triggers creation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…cked This test calls start_creation() but never waited for the background thread to finish, leaking a git clone process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # .gitignore # libs/mng/imbue/mng/hosts/host.py

joshalbrecht · 2026-03-27T01:43:08Z

apps/minds/imbue/minds/forwarding_server/agent_creator_test.py

+    creator.wait_for_completion(agent_id, timeout=10.0)
+
+    creator.wait_for_completion(agent_id, timeout=10.0)


that looks weirdly duplicated

joshalbrecht · 2026-03-27T01:44:10Z

apps/minds/imbue/minds/forwarding_server/backend_resolver_test.py


+        for process in manager._events_processes.values():
+            process.terminate()
+


these seem suss--isn't the whole point of the concurrency group to make sure that this stuff happens automatically?

should i bump the timeout for the concurrency group's garbage collection, then?

joshalbrecht · 2026-03-27T01:44:57Z

apps/minds/imbue/minds/forwarding_server/ssh_tunnel.py

 _SELECT_TIMEOUT_SECONDS: Final[float] = 1.0

-_ACCEPT_TIMEOUT_SECONDS: Final[float] = 1.0
+_ACCEPT_TIMEOUT_SECONDS: Final[float] = 0.2


why lower this? 200ms feels kinda tight for anything

joshalbrecht · 2026-03-27T01:46:20Z

libs/mng/imbue/mng/hosts/host.py

+    def _rsync_paths(
+        self,
+        source_host: OnlineHostInterface,
+        source_path: Path,
+        target_path: Path,


this diff must be weird and stacked on another PR?

joshalbrecht · 2026-03-27T01:47:06Z

libs/mng_llm/imbue/mng_llm/resources/chat.sh

-                    _new_conversation_id=$(mng llmdb poll-new "$_LLM_DB" "$_max_rowid")
+                    _new_conversation_id=$(python3 "$_CONV_DB" poll-new "$_LLM_DB" "$_max_rowid")


I struggle to believe that these are right... pretty sure this is just going to break my stuff

Why did we need this?

joshalbrecht · 2026-03-27T01:47:37Z

hmmm, some of this seems pretty suspicious...

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The merge with origin/main reintroduced the old mng-prefixed directories alongside the new mngr-prefixed ones. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These files came from origin/main after the rename and still had the old module path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

.mng/ was renamed to .mngr/ by the migration but the old directory was re-added during the merge with origin/main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The ratchet catches any 'mng' not followed by 'r' in file contents. The old directories are already removed from git and disk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The constant controls how often the accept loop checks the shutdown event, not how long clients have to connect. The old name was misleading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Our branch had stale merge artifacts: old execute_idempotent_command/ execute_stateful_command split, old _run_shell_command_with_transient_retry method, and more granular SSH error handling. All of these were already refactored on main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # libs/mngr/imbue/mngr/hosts/host.py

The previous conflict resolution accidentally deleted read_file, write_file, read_text_file, write_text_file, and _get_file_mtime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Modal ignores USER directives so this user was never used. The OSError test is handled via monkeypatch instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

evgunter and others added 12 commits March 24, 2026 16:25

Merge branch 'mng/update-ty' into mng/offload

3c9b514

Add test-results as shared worktree symlink for offload baselines

b6721d8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix TOML ordering: move scalar fields before table headers

b403fcb

The work_dir_extra_paths table header was absorbing the default_destroyed_host_persisted_seconds scalar field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

evgunter changed the title ~~Get offload test suite working~~ get entire test suite working with offload Mar 25, 2026

evgunter and others added 17 commits March 25, 2026 12:51

Revert chat script timeout to 10s -- needs real investigation

206ba03

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into mng/offload

3dcffba

# Conflicts: # libs/mng/imbue/mng/hosts/host.py

Merge remote-tracking branch 'origin/main' into mng/offload

580d3be

Merge remote-tracking branch 'origin/main' into mng/offload

1fcb230

Merge remote-tracking branch 'origin/main' into mng/offload

1a544d9

Merge remote-tracking branch 'origin/main' into mng/offload

333d26f

Merge branch 'mng/custom-wt-copy' into mng/offload

f05c3a0

# Conflicts: # libs/mng/imbue/mng/hosts/host.py

Merge branch 'mng/custom-wt-copy' into mng/offload

b521be8

Use ** glob prefix in .dockerignore for nested directory matching

1b6912e

Modal's dockerignore may not match nested __pycache__/ dirs with a bare pattern. Use **/__pycache__/ to be explicit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Terminate events processes in all stream manager CG tests

173f6bb

All 5 tests that use with manager._cg: and call _handle_discovery_line now terminate background mng events processes before exiting the CG. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Include offload-modal.toml in image cache key

a9f8d6a

sandbox_init_cmd is baked into the image, so changes to it need to invalidate the cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

evgunter and others added 4 commits March 26, 2026 13:30

Add missing wait_for_completion to custom_name agent creator test

3851ee0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add close() to test_agent_creator_get_log_queue_returns_queue_for_tra…

3b61ca4

…cked This test calls start_creation() but never waited for the background thread to finish, leaking a git clone process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into mng/offload

baea1fa

# Conflicts: # .gitignore # libs/mng/imbue/mng/hosts/host.py

evgunter marked this pull request as ready for review March 26, 2026 23:34

joshalbrecht reviewed Mar 27, 2026

View reviewed changes

evgunter and others added 19 commits March 27, 2026 02:48

Rename mng -> mngr

8d0dab8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into mng/offload

ea87416

Remove old libs/mng_* directories left over from rename merge

f8d5ef3

The merge with origin/main reintroduced the old mng-prefixed directories alongside the new mngr-prefixed ones. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix old imbue.mng_claude_mind imports in vendor_mng files

e22657b

These files came from origin/main after the rename and still had the old module path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove old vendor_mng files (replaced by vendor_mngr)

9319efe

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove old .mng/ directory and fix MNG_AGENT_STATE_DIR in chat.sh

050745c

.mng/ was renamed to .mngr/ by the migration but the old directory was re-added during the merge with origin/main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove old scripts/mng_log.sh symlink (replaced by mngr_log.sh)

7db32a6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove old mng directories re-added by git add -A

848156b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add old mng directories to .gitignore to prevent re-addition

aba6f8e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove old mng gitignore entries that trigger name ratchet

596aac5

The ratchet catches any 'mng' not followed by 'r' in file contents. The old directories are already removed from git and disk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove duplicate wait_for_completion call

cddc227

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rename _ACCEPT_TIMEOUT_SECONDS to _SHUTDOWN_POLL_SECONDS

c5d7d60

The constant controls how often the accept loop checks the shutdown event, not how long clients have to connect. The old name was misleading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove accidentally committed autofix artifacts

66d9e38

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into mng/offload

faffd74

# Conflicts: # libs/mngr/imbue/mngr/hosts/host.py

lint

3a0de5c

Reset host.py to origin/main -- fix deleted methods from bad merge

07aaa0a

The previous conflict resolution accidentally deleted read_file, write_file, read_text_file, write_text_file, and _get_file_mtime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into mng/offload

1c2971e

Remove unused mngr-test-runner user from Dockerfile

f50c8b1

Modal ignores USER directives so this user was never used. The OSError test is handled via monkeypatch instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get entire test suite working with offload#962

get entire test suite working with offload#962
evgunter wants to merge 59 commits intomainfrom
mng/offload

evgunter commented Mar 25, 2026 •

edited

Loading

Uh oh!

joshalbrecht Mar 27, 2026

Uh oh!

joshalbrecht Mar 27, 2026

Uh oh!

evgunter Mar 27, 2026

Uh oh!

joshalbrecht Mar 27, 2026

Uh oh!

joshalbrecht Mar 27, 2026

Uh oh!

joshalbrecht Mar 27, 2026

Uh oh!

joshalbrecht commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		creator.wait_for_completion(agent_id, timeout=10.0)

		creator.wait_for_completion(agent_id, timeout=10.0)


		for process in manager._events_processes.values():
		process.terminate()

		_new_conversation_id=$(mng llmdb poll-new "$_LLM_DB" "$_max_rowid")
		_new_conversation_id=$(python3 "$_CONV_DB" poll-new "$_LLM_DB" "$_max_rowid")

Conversation

evgunter commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

joshalbrecht Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

joshalbrecht Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

evgunter Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

joshalbrecht Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

joshalbrecht Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

joshalbrecht Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

joshalbrecht commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evgunter commented Mar 25, 2026 •

edited

Loading